Statistische Analyse

Gerko Vink

Methodology & Statistics @ Utrecht University

2 Jun 2025

Disclaimer

I owe a debt of gratitude to many people as the thoughts and code in these slides are the process of years-long development cycles and discussions with my team, friends, colleagues and peers. When someone has contributed to the content of the slides, I have credited their authorship.

Images are either directly linked, or generated with StableDiffusion or DALL-E. That said, there is no information in this presentation that exceeds legal use of copyright materials in academic settings, or that should not be part of the public domain.

Warning

You may use any and all content in this presentation - including my name - and submit it as input to generative AI tools, with the following exception:

  • You must ensure that the content is not used for further training of the model

Slide materials and source code

Materials

Recap

Gisteren hebben we deze onderwerpen behandeld:

  • Het combineren van datasets
  • Groeperen en aggregeren
  • Nieuwe variabelen creëren
  • Filteren en sorteren van gegevens
  • Het maken en aanpassen van datagroepen
  • Clustering van gegevens

Today

Vandaag behandelen we de volgende onderwerpen:

  • Beschrijvende statistiek
  • Kruistabellen en frequentieverdelingen
  • X2
  • toets- en associatiematen
  • Simpele lineaire regressie
  • Analyses draaien op groepen

We use the following packages

library(MASS)
library(dplyr)
library(magrittr)
library(ggplot2)
library(mice)
library(DAAG)
library(car)

set.seed(123)

Linear regression

Linear Model

The linear regression model:

\[y_i=\beta_0+\sum_{j}\beta_{j} x_{ij}+\varepsilon_i, \ \ \ \ \ \ \varepsilon_i\sim N(0, \sigma^2)\] where

  • \(y_i\) is score of individual \(i\) on the numeric dependent variable \(Y\)

  • \(x_{ij}\) is the score of individual \(i\) on predictor \(X_j\)

  • \(\beta_0\) is the intercept

  • \(\beta_j\) is the slope of \(X_j\)

  • \(\varepsilon_{i}\) is the residual (prediction error)

The lm() function

lm(formula, data) # returns an object of class lm
formula model
y ~ 1 intercept-only
y ~ x1 main effect of x1
y ~ x1 + x2 main effects of x1, x2
y ~ . main effects of all predictors
y ~ . - x1 main effects of all predictors except x1
y ~ x1 + x2 + x1:x2 main effects + interaction between x1 and x2
y ~ x1*x2 idem
y ~ .^2 main effects + pairwise interactions between all predictors

Simple regression

Continuous predictor

  • observed = fitted + residual

\[y_i=\hat{y}_i+\varepsilon_i\] - fitted = intercept + slope times x

\[\hat{y}_i=\beta_0 + \beta_1x_i\]


  • fitted changes with \(\beta_1\) for each unit increase in \(x\)

Regression line for mpg ~ disp

ggplot(mtcars, aes(disp, mpg)) + 
  geom_point() +
  xlim(0, 500) +
  geom_smooth(method = "lm", se = FALSE, fullrange=T)

Coefficients

Interpretation of the parameter estimates.

fit <- lm(mpg ~ disp, mtcars)

Structure of lm object

str(fit)
List of 12
 $ coefficients : Named num [1:2] 29.5999 -0.0412
  ..- attr(*, "names")= chr [1:2] "(Intercept)" "disp"
 $ residuals    : Named num [1:32] -2.01 -2.01 -2.35 2.43 3.94 ...
  ..- attr(*, "names")= chr [1:32] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ...
 $ effects      : Named num [1:32] -113.65 -28.44 -1.79 2.65 3.92 ...
  ..- attr(*, "names")= chr [1:32] "(Intercept)" "disp" "" "" ...
 $ rank         : int 2
 $ fitted.values: Named num [1:32] 23 23 25.1 19 14.8 ...
  ..- attr(*, "names")= chr [1:32] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ...
 $ assign       : int [1:2] 0 1
 $ qr           :List of 5
  ..$ qr   : num [1:32, 1:2] -5.657 0.177 0.177 0.177 0.177 ...
  .. ..- attr(*, "dimnames")=List of 2
  .. .. ..$ : chr [1:32] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ...
  .. .. ..$ : chr [1:2] "(Intercept)" "disp"
  .. ..- attr(*, "assign")= int [1:2] 0 1
  ..$ qraux: num [1:2] 1.18 1.09
  ..$ pivot: int [1:2] 1 2
  ..$ tol  : num 1e-07
  ..$ rank : int 2
  ..- attr(*, "class")= chr "qr"
 $ df.residual  : int 30
 $ xlevels      : Named list()
 $ call         : language lm(formula = mpg ~ disp, data = mtcars)
 $ terms        :Classes 'terms', 'formula'  language mpg ~ disp
  .. ..- attr(*, "variables")= language list(mpg, disp)
  .. ..- attr(*, "factors")= int [1:2, 1] 0 1
  .. .. ..- attr(*, "dimnames")=List of 2
  .. .. .. ..$ : chr [1:2] "mpg" "disp"
  .. .. .. ..$ : chr "disp"
  .. ..- attr(*, "term.labels")= chr "disp"
  .. ..- attr(*, "order")= int 1
  .. ..- attr(*, "intercept")= int 1
  .. ..- attr(*, "response")= int 1
  .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
  .. ..- attr(*, "predvars")= language list(mpg, disp)
  .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
  .. .. ..- attr(*, "names")= chr [1:2] "mpg" "disp"
 $ model        :'data.frame':  32 obs. of  2 variables:
  ..$ mpg : num [1:32] 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
  ..$ disp: num [1:32] 160 160 108 258 360 ...
  ..- attr(*, "terms")=Classes 'terms', 'formula'  language mpg ~ disp
  .. .. ..- attr(*, "variables")= language list(mpg, disp)
  .. .. ..- attr(*, "factors")= int [1:2, 1] 0 1
  .. .. .. ..- attr(*, "dimnames")=List of 2
  .. .. .. .. ..$ : chr [1:2] "mpg" "disp"
  .. .. .. .. ..$ : chr "disp"
  .. .. ..- attr(*, "term.labels")= chr "disp"
  .. .. ..- attr(*, "order")= int 1
  .. .. ..- attr(*, "intercept")= int 1
  .. .. ..- attr(*, "response")= int 1
  .. .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
  .. .. ..- attr(*, "predvars")= language list(mpg, disp)
  .. .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
  .. .. .. ..- attr(*, "names")= chr [1:2] "mpg" "disp"
 - attr(*, "class")= chr "lm"

Summary of lm object

summary(fit)

Call:
lm(formula = mpg ~ disp, data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.8922 -2.2022 -0.9631  1.6272  7.2305 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 29.599855   1.229720  24.070  < 2e-16 ***
disp        -0.041215   0.004712  -8.747 9.38e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.251 on 30 degrees of freedom
Multiple R-squared:  0.7183,    Adjusted R-squared:  0.709 
F-statistic: 76.51 on 1 and 30 DF,  p-value: 9.38e-10

Structure summary lm object

str(summary(fit))
List of 11
 $ call         : language lm(formula = mpg ~ disp, data = mtcars)
 $ terms        :Classes 'terms', 'formula'  language mpg ~ disp
  .. ..- attr(*, "variables")= language list(mpg, disp)
  .. ..- attr(*, "factors")= int [1:2, 1] 0 1
  .. .. ..- attr(*, "dimnames")=List of 2
  .. .. .. ..$ : chr [1:2] "mpg" "disp"
  .. .. .. ..$ : chr "disp"
  .. ..- attr(*, "term.labels")= chr "disp"
  .. ..- attr(*, "order")= int 1
  .. ..- attr(*, "intercept")= int 1
  .. ..- attr(*, "response")= int 1
  .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
  .. ..- attr(*, "predvars")= language list(mpg, disp)
  .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
  .. .. ..- attr(*, "names")= chr [1:2] "mpg" "disp"
 $ residuals    : Named num [1:32] -2.01 -2.01 -2.35 2.43 3.94 ...
  ..- attr(*, "names")= chr [1:32] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ...
 $ coefficients : num [1:2, 1:4] 29.59985 -0.04122 1.22972 0.00471 24.07041 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:2] "(Intercept)" "disp"
  .. ..$ : chr [1:4] "Estimate" "Std. Error" "t value" "Pr(>|t|)"
 $ aliased      : Named logi [1:2] FALSE FALSE
  ..- attr(*, "names")= chr [1:2] "(Intercept)" "disp"
 $ sigma        : num 3.25
 $ df           : int [1:3] 2 30 2
 $ r.squared    : num 0.718
 $ adj.r.squared: num 0.709
 $ fstatistic   : Named num [1:3] 76.5 1 30
  ..- attr(*, "names")= chr [1:3] "value" "numdf" "dendf"
 $ cov.unscaled : num [1:2, 1:2] 1.43e-01 -4.85e-04 -4.85e-04 2.10e-06
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:2] "(Intercept)" "disp"
  .. ..$ : chr [1:2] "(Intercept)" "disp"
 - attr(*, "class")= chr "summary.lm"

Extracting lm list elements:

Function / Subsetting Output
coef(fit) / fit$coef coefficients
fitted(fit) / fit$fitted fitted values
resid(fit) / fit$resid residuals
summary(fit)$r.squared R-squared statistic
fit$coef
## (Intercept)        disp 
## 29.59985476 -0.04121512
summary(fit)$r.squared
## [1] 0.7183433

Categorical predictor

Dummy variables

Categorical predictors are converted into dummy variables:

  • each category has a dummy with value 1 for that category, and 0 otherwise

  • except for the reference category (0 on all dummies)

  • all categories are compared to the reference category

Reference category of \(z\) is \(a\)

Interpreting dummies

Model for categorical \(Z\) with categories \(a, b, c\):

\[\hat{y}=\beta_0+\beta_1zb+\beta_2zc\]


parameters interpretation
\(\beta_0\) predicted mean category \(a\) (reference category)
\(\beta_0+\beta_1\) predicted mean category \(b\)
\(\beta_0+\beta_2\) predicted mean category \(c\)

Example

Interpret the parameter estimates of model mpg ~ factor(am)

  • am = 0 is automatic and am = 1 is manual transmission

  • reference category is am = 0

coef(lm(mpg ~ factor(am), mtcars))
(Intercept) factor(am)1 
  17.147368    7.244939 

Linear regression

Linear Model

The linear regression model:

\[y_i=\beta_0+\sum_{j}\beta_{j} x_{ij}+\varepsilon_i, \ \ \ \ \ \ \varepsilon_i\sim N(0, \sigma^2)\] where

  • \(y_i\) is score of individual \(i\) on the numeric dependent variable \(Y\)

  • \(x_{ij}\) is the score of individual \(i\) on predictor \(X_j\)

  • \(\beta_0\) is the intercept

  • \(\beta_j\) is the slope of \(X_j\)

  • \(\varepsilon_{i}\) is the residual (prediction error)

The lm() function

lm(formula, data) # returns an object of class lm
formula model
y ~ 1 intercept-only
y ~ x1 main effect of x1
y ~ x1 + x2 main effects of x1, x2
y ~ . main effects of all predictors
y ~ . - x1 main effects of all predictors except x1
y ~ x1 + x2 + x1:x2 main effects + interaction between x1 and x2
y ~ x1*x2 idem
y ~ .^2 main effects + pairwise interactions between all predictors

Simple regression

Continuous predictor

  • observed = fitted + residual

\[y_i=\hat{y}_i+\varepsilon_i\] - fitted = intercept + slope times x

\[\hat{y}_i=\beta_0 + \beta_1x_i\]


  • fitted changes with \(\beta_1\) for each unit increase in \(x\)

Regression line for mpg ~ disp

ggplot(mtcars, aes(disp, mpg)) + 
  geom_point() +
  xlim(0, 500) +
  geom_smooth(method = "lm", se = FALSE, fullrange=T)

Coefficients

Interpretation of the parameter estimates.

fit <- lm(mpg ~ disp, mtcars)

Structure of lm object

str(fit)
List of 12
 $ coefficients : Named num [1:2] 29.5999 -0.0412
  ..- attr(*, "names")= chr [1:2] "(Intercept)" "disp"
 $ residuals    : Named num [1:32] -2.01 -2.01 -2.35 2.43 3.94 ...
  ..- attr(*, "names")= chr [1:32] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ...
 $ effects      : Named num [1:32] -113.65 -28.44 -1.79 2.65 3.92 ...
  ..- attr(*, "names")= chr [1:32] "(Intercept)" "disp" "" "" ...
 $ rank         : int 2
 $ fitted.values: Named num [1:32] 23 23 25.1 19 14.8 ...
  ..- attr(*, "names")= chr [1:32] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ...
 $ assign       : int [1:2] 0 1
 $ qr           :List of 5
  ..$ qr   : num [1:32, 1:2] -5.657 0.177 0.177 0.177 0.177 ...
  .. ..- attr(*, "dimnames")=List of 2
  .. .. ..$ : chr [1:32] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ...
  .. .. ..$ : chr [1:2] "(Intercept)" "disp"
  .. ..- attr(*, "assign")= int [1:2] 0 1
  ..$ qraux: num [1:2] 1.18 1.09
  ..$ pivot: int [1:2] 1 2
  ..$ tol  : num 1e-07
  ..$ rank : int 2
  ..- attr(*, "class")= chr "qr"
 $ df.residual  : int 30
 $ xlevels      : Named list()
 $ call         : language lm(formula = mpg ~ disp, data = mtcars)
 $ terms        :Classes 'terms', 'formula'  language mpg ~ disp
  .. ..- attr(*, "variables")= language list(mpg, disp)
  .. ..- attr(*, "factors")= int [1:2, 1] 0 1
  .. .. ..- attr(*, "dimnames")=List of 2
  .. .. .. ..$ : chr [1:2] "mpg" "disp"
  .. .. .. ..$ : chr "disp"
  .. ..- attr(*, "term.labels")= chr "disp"
  .. ..- attr(*, "order")= int 1
  .. ..- attr(*, "intercept")= int 1
  .. ..- attr(*, "response")= int 1
  .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
  .. ..- attr(*, "predvars")= language list(mpg, disp)
  .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
  .. .. ..- attr(*, "names")= chr [1:2] "mpg" "disp"
 $ model        :'data.frame':  32 obs. of  2 variables:
  ..$ mpg : num [1:32] 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
  ..$ disp: num [1:32] 160 160 108 258 360 ...
  ..- attr(*, "terms")=Classes 'terms', 'formula'  language mpg ~ disp
  .. .. ..- attr(*, "variables")= language list(mpg, disp)
  .. .. ..- attr(*, "factors")= int [1:2, 1] 0 1
  .. .. .. ..- attr(*, "dimnames")=List of 2
  .. .. .. .. ..$ : chr [1:2] "mpg" "disp"
  .. .. .. .. ..$ : chr "disp"
  .. .. ..- attr(*, "term.labels")= chr "disp"
  .. .. ..- attr(*, "order")= int 1
  .. .. ..- attr(*, "intercept")= int 1
  .. .. ..- attr(*, "response")= int 1
  .. .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
  .. .. ..- attr(*, "predvars")= language list(mpg, disp)
  .. .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
  .. .. .. ..- attr(*, "names")= chr [1:2] "mpg" "disp"
 - attr(*, "class")= chr "lm"

Summary of lm object

summary(fit)

Call:
lm(formula = mpg ~ disp, data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.8922 -2.2022 -0.9631  1.6272  7.2305 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 29.599855   1.229720  24.070  < 2e-16 ***
disp        -0.041215   0.004712  -8.747 9.38e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.251 on 30 degrees of freedom
Multiple R-squared:  0.7183,    Adjusted R-squared:  0.709 
F-statistic: 76.51 on 1 and 30 DF,  p-value: 9.38e-10

Structure summary lm object

str(summary(fit))
List of 11
 $ call         : language lm(formula = mpg ~ disp, data = mtcars)
 $ terms        :Classes 'terms', 'formula'  language mpg ~ disp
  .. ..- attr(*, "variables")= language list(mpg, disp)
  .. ..- attr(*, "factors")= int [1:2, 1] 0 1
  .. .. ..- attr(*, "dimnames")=List of 2
  .. .. .. ..$ : chr [1:2] "mpg" "disp"
  .. .. .. ..$ : chr "disp"
  .. ..- attr(*, "term.labels")= chr "disp"
  .. ..- attr(*, "order")= int 1
  .. ..- attr(*, "intercept")= int 1
  .. ..- attr(*, "response")= int 1
  .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
  .. ..- attr(*, "predvars")= language list(mpg, disp)
  .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
  .. .. ..- attr(*, "names")= chr [1:2] "mpg" "disp"
 $ residuals    : Named num [1:32] -2.01 -2.01 -2.35 2.43 3.94 ...
  ..- attr(*, "names")= chr [1:32] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ...
 $ coefficients : num [1:2, 1:4] 29.59985 -0.04122 1.22972 0.00471 24.07041 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:2] "(Intercept)" "disp"
  .. ..$ : chr [1:4] "Estimate" "Std. Error" "t value" "Pr(>|t|)"
 $ aliased      : Named logi [1:2] FALSE FALSE
  ..- attr(*, "names")= chr [1:2] "(Intercept)" "disp"
 $ sigma        : num 3.25
 $ df           : int [1:3] 2 30 2
 $ r.squared    : num 0.718
 $ adj.r.squared: num 0.709
 $ fstatistic   : Named num [1:3] 76.5 1 30
  ..- attr(*, "names")= chr [1:3] "value" "numdf" "dendf"
 $ cov.unscaled : num [1:2, 1:2] 1.43e-01 -4.85e-04 -4.85e-04 2.10e-06
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:2] "(Intercept)" "disp"
  .. ..$ : chr [1:2] "(Intercept)" "disp"
 - attr(*, "class")= chr "summary.lm"

Extracting lm list elements:

Function / Subsetting Output
coef(fit) / fit$coef coefficients
fitted(fit) / fit$fitted fitted values
resid(fit) / fit$resid residuals
summary(fit)$r.squared R-squared statistic
fit$coef
## (Intercept)        disp 
## 29.59985476 -0.04121512
summary(fit)$r.squared
## [1] 0.7183433

Categorical predictor

Dummy variables

Categorical predictors are converted into dummy variables:

  • each category has a dummy with value 1 for that category, and 0 otherwise

  • except for the reference category (0 on all dummies)

  • all categories are compared to the reference category

Reference category of \(z\) is \(a\)

Interpreting dummies

Model for categorical \(Z\) with categories \(a, b, c\):

\[\hat{y}=\beta_0+\beta_1zb+\beta_2zc\]


parameters interpretation
\(\beta_0\) predicted mean category \(a\) (reference category)
\(\beta_0+\beta_1\) predicted mean category \(b\)
\(\beta_0+\beta_2\) predicted mean category \(c\)

Example

Interpret the parameter estimates of model mpg ~ factor(am)

  • am = 0 is automatic and am = 1 is manual transmission

  • reference category is am = 0

coef(lm(mpg ~ factor(am), mtcars))
(Intercept) factor(am)1 
  17.147368    7.244939 

Model fit

A simple model

boys.fit <- 
  na.omit(boys) %$% # Extremely wasteful
  lm(age ~ reg)
boys.fit

Call:
lm(formula = age ~ reg)

Coefficients:
(Intercept)      regeast      regwest     regsouth      regcity  
    14.7109      -0.8168      -0.7044      -0.6913      -0.6663  
boys %>% na.omit(boys) %$% aggregate(age, list(reg), mean)
  Group.1        x
1   north 14.71094
2    east 13.89410
3    west 14.00657
4   south 14.01965
5    city 14.04460

Plotting the model

means <- boys %>% na.omit(boys) %>% group_by(reg) %>% summarise(age = mean(age))
ggplot(na.omit(boys), aes(x = reg, y = age)) + 
  geom_point(color = "grey") + 
  geom_point(data = means, stat = "identity", size = 3)

Model parameters

boys.fit %>%
  summary()

Call:
lm(formula = age ~ reg)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.8519 -2.5301  0.0254  2.2274  6.2614 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  14.7109     0.5660  25.993   <2e-16 ***
regeast      -0.8168     0.7150  -1.142    0.255    
regwest      -0.7044     0.6970  -1.011    0.313    
regsouth     -0.6913     0.6970  -0.992    0.322    
regcity      -0.6663     0.9038  -0.737    0.462    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.151 on 218 degrees of freedom
Multiple R-squared:  0.006745,  Adjusted R-squared:  -0.01148 
F-statistic: 0.3701 on 4 and 218 DF,  p-value: 0.8298

Is it a good model?

boys.fit %>%
  anova()
Analysis of Variance Table

Response: age
           Df Sum Sq Mean Sq F value Pr(>F)
reg         4   14.7  3.6747  0.3701 0.8298
Residuals 218 2164.6  9.9293               

It is not a very informative model. The anova is not significant, indicating that the contribution of the residuals is larger than the contribution of the model.

The outcome age does not change significantly when reg is varied.

Model factors

boys.fit %>%
  model.matrix() %>%
  head(n = 10)
   (Intercept) regeast regwest regsouth regcity
1            1       0       0        0       0
2            1       0       0        1       0
3            1       0       0        1       0
4            1       1       0        0       0
5            1       0       1        0       0
6            1       1       0        0       0
7            1       0       0        0       0
8            1       0       1        0       0
9            1       0       0        1       0
10           1       0       1        0       0

R expands the categorical variable for us

  • it dummy-codes the 5 categories into 4 dummies (and an intercept).

Post hoc comparisons

coef <- boys.fit %>% aov() %>% summary.lm()
coef

Call:
aov(formula = .)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.8519 -2.5301  0.0254  2.2274  6.2614 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  14.7109     0.5660  25.993   <2e-16 ***
regeast      -0.8168     0.7150  -1.142    0.255    
regwest      -0.7044     0.6970  -1.011    0.313    
regsouth     -0.6913     0.6970  -0.992    0.322    
regcity      -0.6663     0.9038  -0.737    0.462    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.151 on 218 degrees of freedom
Multiple R-squared:  0.006745,  Adjusted R-squared:  -0.01148 
F-statistic: 0.3701 on 4 and 218 DF,  p-value: 0.8298

Post hoc comparisons

Without adjustments for the p-value

na.omit(boys) %$% pairwise.t.test(age, reg, p.adj = "none")

    Pairwise comparisons using t tests with pooled SD 

data:  age and reg 

      north east west south
east  0.25  -    -    -    
west  0.31  0.85 -    -    
south 0.32  0.83 0.98 -    
city  0.46  0.86 0.96 0.98 

P value adjustment method: none 

Post hoc comparisons

With adjusted p-values cf. Bonferoni correction

na.omit(boys) %$% pairwise.t.test(age, reg, p.adj = "bonf")

    Pairwise comparisons using t tests with pooled SD 

data:  age and reg 

      north east west south
east  1     -    -    -    
west  1     1    -    -    
south 1     1    1    -    
city  1     1    1    1    

P value adjustment method: bonferroni 

Post hoc comparisons

Manually calculated

p.val <- coef$coefficients
p.adjust(p.val[, "Pr(>|t|)"], method = "bonferroni")
 (Intercept)      regeast      regwest     regsouth      regcity 
5.077098e-68 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 

If you have trouble reading scientific notation, 5.077098e-68 means the following

\[5.077098\text{e-68} = 5.077098 \times 10^{-68} = 5.077098 \times (\frac{1}{10})^{-68}\]

This indicates that the comma should be moved 68 places to the left:

\[5.077098\text{e-68} = .000000000000000000000000000000000000\] \[000000000000000000000000000000005077098\]

AIC

Akaike’s An Information Criterion

boys.fit %>% 
  AIC()
[1] 1151.684

What is AIC

AIC comes from information theory and can be used for model selection. The AIC quantifies the information that is lost by the statistical model, through the assumption that the data come from the same model. In other words: AIC measures the fit of the model to the data.

  • The better the fit, the less the loss in information
  • AIC works on the log scale:
    • \(\text{log}(0) = -\infty\), \(\text{log}(1) = 0\), etc.
  • the closer the AIC is to \(-\infty\), the better

Model comparison

A new model

Let’s add predictor hgt to the model:

boys.fit2 <- 
  na.omit(boys) %$%
  lm(age ~ reg + hgt)

boys.fit %>% AIC()
[1] 1151.684
boys.fit2 %>% AIC()
[1] 836.3545

Another model

Let’s add wgt to the model

boys.fit3 <- 
  na.omit(boys) %$%
  lm(age ~ reg + hgt + wgt)

And another model

Let’s add wgt and the interaction between wgt and hgt to the model

boys.fit4 <- 
  na.omit(boys) %$%
  lm(age ~ reg + hgt * wgt)

is equivalent to

boys.fit4 <- 
  na.omit(boys) %$%
  lm(age ~ reg + hgt + wgt + hgt:wgt)

Model comparison

boys.fit %>% AIC()
[1] 1151.684
boys.fit2 %>% AIC()
[1] 836.3545
boys.fit3 %>% AIC()
[1] 822.4164
boys.fit4 %>% AIC()
[1] 823.0386

Another form of model comparison

anova(boys.fit, boys.fit2, boys.fit3, boys.fit4)
Analysis of Variance Table

Model 1: age ~ reg
Model 2: age ~ reg + hgt
Model 3: age ~ reg + hgt + wgt
Model 4: age ~ reg + hgt * wgt
  Res.Df     RSS Df Sum of Sq        F    Pr(>F)    
1    218 2164.59                                    
2    217  521.64  1   1642.94 731.8311 < 2.2e-16 ***
3    216  485.66  1     35.98  16.0276 8.595e-05 ***
4    215  482.67  1      2.99   1.3324    0.2497    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Inspect boys.fit3

boys.fit3 %>% anova()
Analysis of Variance Table

Response: age
           Df  Sum Sq Mean Sq  F value    Pr(>F)    
reg         4   14.70    3.67   1.6343    0.1667    
hgt         1 1642.94 1642.94 730.7065 < 2.2e-16 ***
wgt         1   35.98   35.98  16.0029 8.688e-05 ***
Residuals 216  485.66    2.25                       
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Inspect boys.fit4

boys.fit4 %>% anova()
Analysis of Variance Table

Response: age
           Df  Sum Sq Mean Sq  F value    Pr(>F)    
reg         4   14.70    3.67   1.6368    0.1661    
hgt         1 1642.94 1642.94 731.8311 < 2.2e-16 ***
wgt         1   35.98   35.98  16.0276 8.595e-05 ***
hgt:wgt     1    2.99    2.99   1.3324    0.2497    
Residuals 215  482.67    2.24                       
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

It seems that reg and the interaction hgt:wgt are redundant

Remove reg

boys.fit5 <- 
  na.omit(boys) %$%
  lm(age ~ hgt + wgt)

Let’s revisit the comparison

anova(boys.fit, boys.fit2, boys.fit3, boys.fit5)
Analysis of Variance Table

Model 1: age ~ reg
Model 2: age ~ reg + hgt
Model 3: age ~ reg + hgt + wgt
Model 4: age ~ hgt + wgt
  Res.Df     RSS Df Sum of Sq        F    Pr(>F)    
1    218 2164.59                                    
2    217  521.64  1   1642.94 730.7065 < 2.2e-16 ***
3    216  485.66  1     35.98  16.0029 8.688e-05 ***
4    220  492.43 -4     -6.77   0.7529    0.5571    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

But the boys.fit5 model is better than the previous model with fewer parameters

Stepwise regression

We start with the full model, which contains all parameters for all columns.

The most straightforward way to go about this is by specifying the following model:

full.model <- lm(age ~ ., data = na.omit(boys))
full.model

Call:
lm(formula = age ~ ., data = na.omit(boys))

Coefficients:
(Intercept)          hgt          wgt          bmi           hc        gen.L  
   2.556051     0.059987    -0.009846     0.142162    -0.024086     1.287455  
      gen.Q        gen.C        gen^4        phb.L        phb.Q        phb.C  
  -0.006861    -0.187256     0.034186     1.552398     0.499620     0.656277  
      phb^4        phb^5           tv      regeast      regwest     regsouth  
  -0.094722    -0.113686     0.074321    -0.222249    -0.233307    -0.258771  
    regcity  
   0.188423  

Stepwise regression - continued

We can then start with specifying the stepwise model. In this case we choose direction both.

step.model <- step(full.model, direction = "both", 
                      trace = FALSE)
step.model

Call:
lm(formula = age ~ hgt + bmi + phb + tv, data = na.omit(boys))

Coefficients:
(Intercept)          hgt          bmi        phb.L        phb.Q        phb.C  
    2.07085      0.05482      0.10742      2.70328      0.28877      0.48492  
      phb^4        phb^5           tv  
   -0.06225     -0.15167      0.07957  

Other options are

  • forward: fit all univariate models, add the best predictor and continue.
  • backward: fit the full model, eliminate the worst predictor and continue.

Summary

step.model %>% summary

Call:
lm(formula = age ~ hgt + bmi + phb + tv, data = na.omit(boys))

Residuals:
    Min      1Q  Median      3Q     Max 
-3.5077 -0.7144 -0.0814  0.6850  3.1724 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  2.070854   1.576365   1.314  0.19036    
hgt          0.054822   0.009986   5.490 1.13e-07 ***
bmi          0.107415   0.030135   3.564  0.00045 ***
phb.L        2.703275   0.437108   6.184 3.12e-09 ***
phb.Q        0.288768   0.211836   1.363  0.17426    
phb.C        0.484921   0.202856   2.390  0.01769 *  
phb^4       -0.062246   0.208472  -0.299  0.76555    
phb^5       -0.151667   0.240599  -0.630  0.52912    
tv           0.079573   0.019600   4.060 6.89e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.172 on 214 degrees of freedom
Multiple R-squared:  0.8651,    Adjusted R-squared:   0.86 
F-statistic: 171.5 on 8 and 214 DF,  p-value: < 2.2e-16

Stepwise regression - AIC

full.model <- lm(age ~ ., data = na.omit(boys))
step.model <- MASS::stepAIC(full.model, direction = "both", 
                      trace = FALSE)
step.model

Call:
lm(formula = age ~ hgt + bmi + phb + tv, data = na.omit(boys))

Coefficients:
(Intercept)          hgt          bmi        phb.L        phb.Q        phb.C  
    2.07085      0.05482      0.10742      2.70328      0.28877      0.48492  
      phb^4        phb^5           tv  
   -0.06225     -0.15167      0.07957  

Influence of cases

DfBeta calculates the change in coefficients depicted as deviation in SE’s.

step.model %>%
  dfbeta() %>%
  head(n = 7)
      (Intercept)           hgt           bmi        phb.L         phb.Q
3279 -0.142127090  0.0010828575 -0.0021654085 -0.021523825 -0.0006087783
3283 -0.005201914  0.0001127452 -0.0014100750  0.009068652 -0.0147293625
3296 -0.081439010  0.0004244061  0.0002603379 -0.009247562 -0.0071293767
3321 -0.084630826  0.0004186923  0.0008222594 -0.005498486 -0.0059276580
3323  0.154386486 -0.0004476521 -0.0052964367  0.042872835 -0.0213761347
3327 -0.046095468  0.0001081361  0.0011150546 -0.004047875 -0.0085461280
3357 -0.045933397  0.0001345605  0.0009770371 -0.005087053 -0.0062604484
           phb.C        phb^4        phb^5            tv
3279 0.007427067 -0.004226982 -0.001213833  0.0001061106
3283 0.011761595 -0.005498411  0.001164307  0.0007051901
3296 0.008538302 -0.003688611  0.000842872  0.0002494383
3321 0.006839217 -0.003341741  0.000866896 -0.0002569914
3323 0.011758694 -0.006659755  0.000493897  0.0012317886
3327 0.007807966 -0.002901270  0.001499505  0.0003376220
3357 0.006152384 -0.002274777  0.001148552  0.0002329346

Prediction

Fitted values

Let’s use the simpler anscombe data example

fit <- anscombe %$% lm(y1 ~ x1)

y_hat <- 
  fit %>%
  fitted.values()

The residual is then calculated as

y_hat - anscombe$y1
          1           2           3           4           5           6 
-0.03900000  0.05081818  1.92127273 -1.30909091  0.17109091  0.04136364 
          7           8           9          10          11 
-1.23936364  0.74045455 -1.83881818  1.68072727 -0.17945455 

Predict new values

If we introduce new values for the predictor x1, we can generate predicted values from the model

new.x1 <- data.frame(x1 = 1:20)
fit %>% predict(newdata = new.x1)
        1         2         3         4         5         6         7         8 
 3.500182  4.000273  4.500364  5.000455  5.500545  6.000636  6.500727  7.000818 
        9        10        11        12        13        14        15        16 
 7.500909  8.001000  8.501091  9.001182  9.501273 10.001364 10.501455 11.001545 
       17        18        19        20 
11.501636 12.001727 12.501818 13.001909 

Predictions are draws from the regression line

pred <- fit %>% predict(newdata = new.x1)
lm(pred ~ new.x1$x1)$coefficients
(Intercept)   new.x1$x1 
  3.0000909   0.5000909 
fit$coefficients
(Intercept)          x1 
  3.0000909   0.5000909 

Prediction intervals

fit %>% predict(interval = "prediction")
         fit      lwr       upr
1   8.001000 5.067072 10.934928
2   7.000818 4.066890  9.934747
3   9.501273 6.390801 12.611745
4   7.500909 4.579129 10.422689
5   8.501091 5.531014 11.471168
6  10.001364 6.789620 13.213107
7   6.000636 2.971271  9.030002
8   5.000455 1.788711  8.212198
9   9.001182 5.971816 12.030547
10  6.500727 3.530650  9.470804
11  5.500545 2.390073  8.611017

A prediction interval reflects the uncertainty around a single value. The confidence interval reflects the uncertainty around the mean prediction values.

How many cases are used?

na.omit(boys) %$%
  lm(age ~ reg + hgt * wgt) %>%
  nobs()
[1] 223

If we would not have used na.omit()

boys %$%
  lm(age ~ reg + hgt * wgt) %>%
  nobs()
[1] 724

Confidence intervals?

95% confidence interval

If an infinite number of samples were drawn and CI’s computed, then the true population mean \(\mu\) would be in at least 95% of these intervals

[ 95%~CI={x}_{(1-/2)}SEM ]

Example

x.bar <- 7.6 # sample mean
SEM   <- 2.1 # standard error of the mean
n     <- 11 # sample size
df    <- n-1 # degrees of freedom
alpha <- .15 # significance level
t.crit <- qt(1 - alpha / 2, df) # t(1 - alpha / 2) for df = 10
c(x.bar - t.crit * SEM, x.bar + t.crit * SEM) 
[1]  4.325605 10.874395

HTML5 Icon
    Neyman, J. (1934). On the Two Different Aspects of the Representative Method: 
    The Method of Stratified Sampling and the Method of Purposive Selection. 
    Journal of the Royal Statistical Society, Vol. 97, No. 4 (1934), pp. 558-625

Misconceptions

Confidence intervals are frequently misunderstood, even well-established researchers sometimes misinterpret them. .

  1. A realised 95% CI does not mean:
  • that there is a 95% probability the population parameter lies within the interval

  • that there is a 95% probability that the interval covers the population parameter

    Once an experiment is done and an interval is calculated, the interval either covers, or does not cover the parameter value. Probability is no longer involved.

    The 95% probability only has to do with the estimation procedure.

  1. A 95% confidence interval does not mean that 95% of the sample data lie within the interval.
  2. A confidence interval is not a range of plausible values for the sample mean, though it may be understood as an estimate of plausible values for the population parameter.
  3. A particular confidence interval of 95% calculated from an experiment does not mean that there is a 95% probability of a sample mean from a repeat of the experiment falling within this interval.

Confidence intervals

100 simulated samples from a population with \(\mu = 0\) and \(\sigma^2=1\). Out of 100 samples, only 5 samples have confidence intervals that do not cover the population mean.

So far

At this point we have covered the following models:

  • Simple linear regression (SLR)

[y=+x+]

The relationship between a numerical outcome and a numerical or categorical predictor

  • Multiple linear regression (MLR)

[y=+_1 x_1 + _2 x_2 + _p x_p + ]

The relationship between a numerical outcome and multiple numerical or categorical predictors

What remains

We have not yet covered how to handle outcomes that are not categorical or how to deal with predictors that are nonlinear or have a strict dependency structure.

What we have learned

We have covered the following topics last week:

  • fit SLR and MLR models
  • select MLR models
  • interpret model parameters
  • perform hypothesis test on slope and intercept parameters
  • perform hypothesis test for the whole regression model
  • calculate confidence intervals for regression parameters
  • obtain prediction intervals for fitted values
  • study the influence of single cases
  • study the validity of linear regression assumptions:
    • linearity, constant residual variance
  • study the residuals, leverage and Cook’s distance

Rewriting what we know

Instead of modeling

[y=+x+]

we can also consider [[y] = + x]

They’re the same. Different notation, different framework.

The upside is that we can now use a function for the expectation \(\mathbb{E}\) to allow for transformations. This would enable us to change \(\mathbb{E}[y]\) such that \(f(\mathbb{E}[y])\) has a linear relation with \(x\).

This is what we will be doing today

Illustration of the problem

A simulated data set

To further illustrate why the linear model is not an appropriate model for discrete data I propose the following simple simulated data set:

set.seed(123)
simulated <- data.frame(discrete = c(rep(0, 50), rep(1, 50)),
                        continuous = c(rnorm(50, 10, 3), rnorm(50, 15, 3)))

simulated %>% summary
    discrete     continuous    
 Min.   :0.0   Min.   : 4.100  
 1st Qu.:0.0   1st Qu.: 9.656  
 Median :0.5   Median :12.904  
 Mean   :0.5   Mean   :12.771  
 3rd Qu.:1.0   3rd Qu.:15.570  
 Max.   :1.0   Max.   :21.562  

This data allows us to illustrate modeling the relation between the discrete outcome and the continuous predictor with logistic regression.

Remember that fixing the random seed allows for a replicable random number generator sequence.

Visualizing simulated data

simulated %>% ggplot(aes(x = continuous, y = discrete)) +
  geom_point()

Modeling simulated with lm

simulated %>% ggplot(aes(x = continuous, y = discrete)) +
  geom_point() + geom_smooth(method = "lm", se = FALSE, color = "orange") 

The orange line represents the lm linear regression line. It is not a good representation for our data, as it assumes the data are continuous and projects values outside of the range of the observed data.

Modeling simulated with glm

simulated %>% ggplot(aes(x = continuous, y = discrete)) +
  geom_point() + geom_smooth(method = "lm", se = FALSE, color = "orange") +
  geom_smooth(method = "glm", method.args = list(family = "binomial"), se = FALSE) 

The blue glm logistic regression line represents this data infinitely better than the orange lm line. It assumes the data to be 0 or 1 and does not project values outside of the range of the observed data.

How does this work?

Generalized linear modeling

There is a very general way of addressing this type of problem in regression. The models that use this general way are called generalized linear models (GLMs).

Every generalized linear model has the following three characteristics:

  1. A probability distribution that describes the outcome
  2. A linear predictor model
  3. A link function that relates the linear predictor to the the parameter of the outcome’s probability distribution.

The linear predictor model in (2) is \[\eta = \bf{X}\beta\] where \(\eta\) denotes a linear predictor and the link function in (3) is \[\bf{X}\beta = g(\mu)\] The technique to model a binary outcome based on a set of continuous or discrete predictors is called logistic regression. Logistic regression is an example of a generalized linear model.

Modeling the odds

Odds are a way of quantifying the probability of an event \(E\)

The odds for an event \(E\) are \[\text{odds}(E) = \frac{P(E)}{P(E^c)} = \frac{P(E)}{1 - P(E)}\] The odds of getting heads in a coin toss is

\[\text{odds}(\text{heads}) = \frac{P(\text{heads})}{P(\text{tails})} = \frac{P(\text{heads})}{1 - P(\text{heads})}\] For a fair coin, this would result in

\[\text{odds}(\text{heads}) = \frac{.5}{1 - .5} = 1\]

Another odds example

The game Lingo has 44 balls: 36 blue, 6 red and 2 green balls

  • The odds of a player choosing a blue ball are \[\text{odds}(\text{blue}) = \frac{36}{8} = \frac{36/44}{8/44} = \frac{.8182}{.1818} = 4.5\]
  • The odds of a player choosing a red ball are \[\text{odds}(\text{red}) = \frac{6}{38} = \frac{6/44}{36/44} = \frac{.1364}{.8636}\approx .16\]
  • The odds of a player choosing a green ball are \[\text{odds}(\text{green}) = \frac{2}{42} = \frac{2/44}{42/44} = \frac{.0455}{.9545}\approx .05\]

Odds of 1 indicate an equal likelihood of the event occuring or not occuring. Odds < 1 indicate a lower likelihood of the event occuring vs. not occuring. Odds > 1 indicate a higher likelihood of the event occuring.

GLM’s continued

Remember that [y=+x+,]

and that [[y] = + x.]

As a result [y = [y] + .]

and residuals do not need to be normal (heck, \(y\) probably isn’t, so why should \(\epsilon\) be?)

Logistic regression

Logistic regression is a GLM used to model a binary categorical variable using numerical and categorical predictors.

In logistic regression we assume that the true data generating model for the outcome variable follows a binomial distribution.

  • it is therefore intuitive to think of logistic regression as modeling the probability of succes \(p\) for any given set of predictors.

How

We specify a reasonable link that connects \(\eta\) to \(p\). Most common in logistic regression is the logit link

\[logit(p)=\text{log}(\frac{p}{1−p}) , \text{ for } 0 \leq p \leq 1\] We might recognize \(\frac{p}{1−p}\) as the odds.

\(\log(\text{odds})\) explained

Now if we visualize the relation between our predictor(s) and the logodds

Logistic regression

Logistic regression

With linear regression we had the Sum of Squares (SS). Its logistic counterpart is the Deviance (D).

  • Deviance is the fit of the observed values to the expected values.

With logistic regression we aim to maximize the likelihood, which is equivalent to minimizing the deviance.

The likelihood is the (joint) probability of the observed values, given the current model parameters.

In normally distributed data: \(\text{SS}=\text{D}\).

The logistic regression model

Remember the three characteristics for every generalized linear model:

  1. A probability distribution that describes the outcome
  2. A linear predictor model
  3. A link function that relates the linear predictor to the the parameter of the outcome’s probability distribution.

For the logistic model this gives us:

  1. \(y_i \sim \text{Binom}(p_i)\)
  2. \(\eta = \beta_0 + \beta_1x_1 + \dots + \beta_nx_n\)
  3. \(\text{logit}(p) = \eta\)

Simple substitution brings us at

\[p_i = \frac{\text{exp}(\eta)}{1+\text{exp}(\eta)} = \frac{\text{exp}(\beta_0 + \beta_1x_{1,i} + \dots + \beta_nx_{n,i})}{1+\text{exp}(\beta_0 + \beta_1x_{1,i} + \dots + \beta_nx_{n,i})}\]

Fitting a logistic regression

The anesthetic data

anesthetic %>% head(n = 10)
   move conc    logconc nomove
1     0  1.0  0.0000000      1
2     1  1.2  0.1823216      0
3     0  1.4  0.3364722      1
4     1  1.4  0.3364722      0
5     1  1.2  0.1823216      0
6     0  2.5  0.9162907      1
7     0  1.6  0.4700036      1
8     1  0.8 -0.2231436      0
9     0  1.6  0.4700036      1
10    1  1.4  0.3364722      0

Thirty patients were given an anesthetic agent maintained at a predetermined level (conc) for 15 minutes before making an incision. It was then noted whether the patient moved, i.e. jerked or twisted.

Fitting a logistic regression model

Fitting a glm in R is not much different from fitting a lm. We do, however, need to specify what type of glm to use by specifying both the family and the type of link function we need.

For logistic regression we need the binomial family as the binomial distribution is the probability distribution that describes our outcome. We also use the logit link, which is the default for the binomial glm family.

fit <- anesthetic %$% 
  glm(nomove ~ conc, family = binomial(link="logit"))
fit

Call:  glm(formula = nomove ~ conc, family = binomial(link = "logit"))

Coefficients:
(Intercept)         conc  
     -6.469        5.567  

Degrees of Freedom: 29 Total (i.e. Null);  28 Residual
Null Deviance:      41.46 
Residual Deviance: 27.75    AIC: 31.75

The model parameters

fit %>% summary

Call:
glm(formula = nomove ~ conc, family = binomial(link = "logit"))

Coefficients:
            Estimate Std. Error z value Pr(>|z|)   
(Intercept)   -6.469      2.418  -2.675  0.00748 **
conc           5.567      2.044   2.724  0.00645 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 41.455  on 29  degrees of freedom
Residual deviance: 27.754  on 28  degrees of freedom
AIC: 31.754

Number of Fisher Scoring iterations: 5

The regression parameters

fit %>% summary %>% .$coefficients
             Estimate Std. Error   z value    Pr(>|z|)
(Intercept) -6.468675   2.418470 -2.674697 0.007479691
conc         5.566762   2.043591  2.724010 0.006449448

With every unit increase in concentration conc, the log odds of not moving increases with 5.5667617. This increase can be considered different from zero as the p-value is 0.0064494.

In other words; an increase in conc will lower the probability of moving. We can verify this by modeling move instead of nomove:

anesthetic %$% 
  glm(move ~ conc, family = binomial(link="logit")) %>%
  summary %>% .$coefficients
             Estimate Std. Error   z value    Pr(>|z|)
(Intercept)  6.468675   2.418470  2.674697 0.007479691
conc        -5.566762   2.043591 -2.724010 0.006449448

However..

Error

library(caret)
pred <- fit %>% predict(type = "response")
confusionMatrix(data = factor(as.numeric(pred > 0.5), labels = c("move", "nomove")),
                reference = factor(anesthetic$nomove, labels = c("move", "nomove")))
Confusion Matrix and Statistics

          Reference
Prediction move nomove
    move     10      2
    nomove    4     14
                                          
               Accuracy : 0.8             
                 95% CI : (0.6143, 0.9229)
    No Information Rate : 0.5333          
    P-Value [Acc > NIR] : 0.002316        
                                          
                  Kappa : 0.5946          
                                          
 Mcnemar's Test P-Value : 0.683091        
                                          
            Sensitivity : 0.7143          
            Specificity : 0.8750          
         Pos Pred Value : 0.8333          
         Neg Pred Value : 0.7778          
             Prevalence : 0.4667          
         Detection Rate : 0.3333          
   Detection Prevalence : 0.4000          
      Balanced Accuracy : 0.7946          
                                          
       'Positive' Class : move            
                                          

Issue

With the error of the model (or lack thereof) - comes a problem.

  1. Is the (lack of) error due to the modeling?
  2. Would another model give us a different error?
  3. Is the (lack of error) due to the data?
  4. Would other data give us a different error?
    If (1) and (2) are the case –> model selection needed to improve the model
    But what if the better model is only due to our data?
    If (3) and (4) are the case –> we need other data to validate that our model is reliable

Some concepts

Training
If the model will only fit well on the data is has been trained on, then we are overfitting. We have then successfully modeled not only the data, but also (much of) the noise.

  • overfitted models have little bias, but high variance
  • the opposite, underfitting occurs when the model is too simple and cannot capture the structure of the data.
    • underfitted models have high bias, but low variance.

A great example is the library of babel. It contains every phrase, page, etc. that will ever be written in English. However, it is a most inefficient way of writing beautiful literature.

Testing
To avoid overfitting we can train the model on one data set and test its performance on another (seperate) data set that comes from the same true data generating model.

Validation
If we are also optimizing hyperparameters, then the in-between-step of validation makes sense. You then train the initial model on one data set, validate its optimization on another data set and finally test its performance on the last data set.

On real data

Collecting multiple independent data sets to realize a true train/validate/test approach is a costly endeavor that takes up a lot of time and resources.

Alternative: splitting the observed data
We can randomly split the data into 2 parts: one part for training and one part for testing

  • The downside to this approach is that everything becomes split-depended
    • in theory we could still obtain a good or bad performance because of the split.
    • influential values that are not part of the training data will skew the test performance
    • variance is usually low, but bias can be higher

Solution: K-fold crossvalidation
Partition the data into \(K\) folds and use each fold once as the test set and the remaining \(K-1\) folds as the training set.

  • You’ll have \(K\) estimates of the test accuracy
  • These \(K\) estimates balance bias and variance
  • A special case is when you take \(K = N\). This is called leave on out crossvalidation

Example

set.seed(123)
library(caret)

# define training control
train_control <- trainControl(method = "cv", number = 3, savePredictions = TRUE)

# train the model on training set
model <- train(as.factor(move) ~ conc,
               data = DAAG::anesthetic,
               trControl = train_control,
               method = "glm",
               family = binomial(link = "logit"))

# print cv scores
model$results
  parameter  Accuracy     Kappa AccuracySD  KappaSD
1      none 0.7919192 0.5737505   0.121414 0.253953

The predictions

model$pred

For fun


source